# Free or Paid, Rates of Google Play Store Users## BackgroundUsers download apps for various usage purposes. Given that paid service is usually better at offering pleasant experience, and that free apps are more accesible to everyone, what are the user opinions towards these apps? More specifically, the following questions are of interest:- How do the app ratings differ between paid and free apps in general?- How are the differences distributed across different app categories?- Are there any categories where the differences are statistically significant?To expore answers to the above questions, I narrawed the context to Google Play Store and conducted data analysis on the Kaggle dataset [`Google Play Store`](https://www.kaggle.com/lava18/google-play-store-apps/home),## AcknowledgementI would like to thank Google Play Store and [Lavanya Gupta](https://www.kaggle.com/lava18) for offering the wonderful dataset.Users download apps for various usage purposes. Given that paid service is usually better at offering pleasant experience, and that free apps are more accesible to everyone, what are the user opinions towards these apps?
More specifically, the following questions are of interest:
To expore answers to the above questions, I narrawed the context to Google Play Store and conducted data analysis on the Kaggle dataset Google Play Store,
I would like to thank Google Play Store and Lavanya Gupta for offering the wonderful dataset.
# import packagesimport pandas as pdimport seaborn as snsimport numpy as npimport refrom scipy.stats import mannwhitneyufrom matplotlib import pyplot as plt# Read dataframe and display datadf = pd.read_csv('googleplaystore.csv')df.head(5)df.shapexxxxxxxxxxdf.columnsxxxxxxxxxxdf.describexxxxxxxxxxdf.boxplot()xxxxxxxxxxdf.hist()df.info()xxxxxxxxxxdf.isnull()xxxxxxxxxxdf.isnull().sum()xxxxxxxxxxdf[df.Rating>5]xxxxxxxxxxdf.drop([10472],inplace=True)df[10470:10475]xxxxxxxxxxdf.boxplot()xxxxxxxxxxdf.hist()# check duplicatesn_duplicated = df.duplicated(subset=['App']).sum()print("There are {}/{} duplicated records.".format(n_duplicated, df.shape[0]))df_no_dup = df.drop(df.index[df.App.duplicated()], axis=0)print("{} records after dropping duplicated.".format(df_no_dup.shape[0]))# Check and clean type values, defer nan value processing to the next cellprint(set(df_no_dup.Type))print("Dropping alien Type value '0', {} record(s) removed".format(sum(df_no_dup.Type == '0')))df_no_dup = df_no_dup.drop(df_no_dup.index[df_no_dup.Type == '0'], axis=0)# check and drop NaN valuesprint("NaA value statistics in each column")print(df_no_dup.isnull().sum(axis=0),'\n')df_no_dup = df_no_dup.dropna(subset=['Type'])print("Column 'Type' with NaN values are dropped, {} records left.".format(df_no_dup.shape[0]))# prepare rating dataframedf_rating = df_no_dup.dropna(subset=['Rating'])print("Cleaned dataframe for 'Rating' has {} records.".format(df_rating.shape[0]))# we are interested in the columns Category, Rating and Type# Drop irrelevant columns for Rating dataframe.df_rating = df_rating.loc[:,['Rating', 'Type', 'Category']]def plot_hist(df, col, bins=10): """ Plot histograms for a column """ plt.hist(df[col], bins=bins) plt.xlabel(col) plt.ylabel('counts') plt.title('Distribution of {}'.format(col))def compute_app_types(df): """ Given a dataframe, compute the number of free and paid apps respectively """ return sum(df.Type == "Free"), sum(df.Type == 'Paid')def plot_app_types(df): """ Plot app type distributions across categories """ vc_rating = df.Category.value_counts() cat_free_apps = [] cat_paid_apps = [] for cat in vc_rating.index: n_free, n_paid = compute_app_types(df.query("Category == '{}'".format(cat))) cat_free_apps.append(n_free) cat_paid_apps.append(n_paid) f, ax = plt.subplots(2,1) ax[0].bar(range(1, len(cat_free_apps)+1), cat_free_apps) ax[1].bar(range(1, len(cat_free_apps)+1), cat_paid_apps)def drop_categories(df): """ Drop categories with any app type with instances fewer than 10 """ vc_rating = df.Category.value_counts() cats_to_drop = [] for cat in vc_rating.index: n_free, n_paid = compute_app_types(df.query("Category == '{}'".format(cat))) if n_free < 10 or n_paid < 10: cats_to_drop.append(cat) for cat in cats_to_drop: df.drop(df.query('Category == "{}"'.format(cat)).index, axis=0, inplace=True) print("Deleted categories: {}".format(cats_to_drop)) return df# Describe Rating dataframeplot_hist(df_rating, 'Rating')df_rating.describe()print("There are {} free and {} paid apps in the the Rating dataframe ".format(*compute_app_types(df_rating)))# explore the distributions of free and paid apps across different categoriesplot_app_types(df_rating)# Exclude categories with fewer than 10 apps for any Free or Paid type# Otherwise the categories would contain too few data to generalize the resultdf_rating = drop_categories(df_rating)print("Cleaned Rating dataframe has {} datapoints".format(df_rating.shape[0]))df_rating.describe()def plot_target_by_group(df, target_col, group_col, figsize=(6,4), title=""): """ Plot the mean of a target column (Numeric) groupped by the group column (categorical) """ order = sorted(list(set(df[group_col]))) stats = df.groupby(group_col).mean()[target_col] fig, ax = plt.subplots(figsize=figsize) sns.barplot(x=group_col, y=target_col, data=df, ax=ax, order=order).set_title(title) ax.set(ylim=(3.8, 4.5)) return statsstats = plot_target_by_group(df_rating, 'Rating', 'Type', title="Average Rating Groupped by App Type")for i, s in zip(stats.index, stats): print("{} app has average {} {}".format(i, 'Rating',s))mean_rating = df_rating.Rating.mean()print("Mean rating: {}".format(mean_rating))#### InterpretationIn general, Free apps, with an average rating of 4.16, are lower rated than Paid apps with an average rating of 4.27. Note that the average rating for all apps is 4.17, so Free apps are rated below average, while Paid apps are rated reletively higher than the average score.In general, Free apps, with an average rating of 4.16, are lower rated than Paid apps with an average rating of 4.27. Note that the average rating for all apps is 4.17, so Free apps are rated below average, while Paid apps are rated reletively higher than the average score.
### Q2 How are the differences distributed across different app categories?paid_stats = plot_target_by_group(df_rating.query('Type == "Paid"'), 'Rating', 'Category', (16, 4), "(Paid App) Average Ratings by App Category")free_stats = plot_target_by_group(df_rating.query('Type == "Free"'), 'Rating', 'Category', (16, 4), "(Free App) Average Ratings by App Category")fig, ax = plt.subplots(figsize=(16,4))sorted_idx = sorted(paid_stats.index)rating_diff = paid_stats[sorted_idx] - free_stats[sorted_idx]sns.barplot(x=sorted_idx, y=rating_diff, ax=ax).set_title("Difference of Ratings between Paid and Free Apps Across App Categories");rating_diff#### InterpretationAlthough paid apps are in general more highly-rated than free apps, and so are in most app categories, there are still some app categories where free apps are likely to be favored more than the paid apps. For instance, COMMUNICATION, FINANCE and PHOTOGRAPHY are three such categories. In FINANCE category, the free apps on average are rated almost 0.3 higher than the paid apps, which is also the largest difference between app types across all the categories.Although paid apps are in general more highly-rated than free apps, and so are in most app categories, there are still some app categories where free apps are likely to be favored more than the paid apps. For instance, COMMUNICATION, FINANCE and PHOTOGRAPHY are three such categories. In FINANCE category, the free apps on average are rated almost 0.3 higher than the paid apps, which is also the largest difference between app types across all the categories.
### Q3 Are there any categories where the differences are statistically significant?def compute_utest(df): """ Compute Mann-Whitney rank tests for paid and free app ratings """ paid_rating = df.query('Type == "Paid"')['Rating'] free_rating = df.query('Type == "Free"')['Rating'] return mannwhitneyu(paid_rating, free_rating)def cat_utest(df): """ Iteratively compute utest for each app category """ cats = set(df.Category) res = [] for cat in cats: stats, pval = compute_utest(df.query('Category == "{}"'.format(cat))) res.append({'Category':cat, 'u_statistics':stats, 'p_value':pval}) return pd.DataFrame(res)uval, pval = compute_utest(df_rating)print("General utest result: pval {}, u {}".format(pval, uval))df_utest = cat_utest(df_rating) df_utest.loc[df_utest.p_value < .05] # significant categoriesxxxxxxxxxx# Visualizationimport matplotlib.pyplot as pltimport plotlyimport plotly.express as pximport plotly.graph_objects as goimport seaborn as sns# Modelfrom sklearn.cluster import KMeansfrom sklearn.cluster import DBSCANname = df['Category'].value_counts().indexnum = df['Category'].value_counts().valuesfig = px.pie(data_frame=df,names=name,values=num, title='Pies chart show application categories.',width=1200,height=600)fig.update_traces(textposition='inside',textinfo='label+percent')fig.show()From all application, 18% typed by Family then follow by Game (10.6%) and Tools (7.78%)Apps in google play store not sute for do business but for entertainmentFrom all application, 18% typed by Family then follow by Game (10.6%) and Tools (7.78%) Apps in google play store not sute for do business but for entertainment
Categories and RatingCategories and Rating
plt.figure(figsize=(18,6))sns.barplot(x='Category',y='Rating',data=df)plt.xticks(rotation=90)plt.title('Application category Vs Average Rating',fontsize=15)plt.xlabel('Application Catagory',fontsize=12)plt.ylabel('Rating',fontsize=12)plt.ylim(0,5)plt.show()Any of categories have the high rating from customer which more than 4 except Dating apps.Any of categories have the high rating from customer which more than 4 except Dating apps.
plt.figure(figsize=(18,6))sns.scatterplot(x='Category',y='Rating',data=df,s=40)plt.xticks(rotation=90)plt.title('Application category Vs Rating (Scatter)',fontsize=15)plt.xlabel('Application Catagory',fontsize=12)plt.ylabel('Rating',fontsize=12)plt.ylim(0.5,5.5)plt.show()Every category got rating in range 3-5 point but from the plot show people quite unlike Finance, Lifestyle, and Tools applicationsEvery category got rating in range 3-5 point but from the plot show people quite unlike Finance, Lifestyle, and Tools applications
Types of installing applicationTypes of installing application
xxxxxxxxxxx = df['Type'].value_counts().indexy = df['Type'].value_counts().valuesfig = px.pie(data_frame=df,names=x,values=y,width=800,height=600 ,title='Types of application.')fig.update_traces(textposition='inside',textinfo='label+percent')fig.show()Type of application are free for 92.6% and 7.38% have to pays for use.Type of application are free for 92.6% and 7.38% have to pays for use.
xxxxxxxxxx#dropnadf.dropna(inplace=True)data = df.copy()data.head()Change all of catagory data to ordinalChange all of catagory data to ordinal
from collections import defaultdictfrom sklearn.preprocessing import LabelEncoder#LabelEncoderd = defaultdict(LabelEncoder)data = data.apply(lambda x : d[x.name].fit_transform(x))data.head()# K-MeansX = data[['App','Category','Rating','Reviews','Size','Installs','Type','Price','Content Rating','Genres']]model = KMeans(n_clusters=5).fit(X)y_kmeans = model.fit_predict(X)xxxxxxxxxxdf['Predicted_Cluster'] = y_kmeans# Visualizing all the clusterplt.figure(figsize=(15,8))sns.scatterplot(x='Category',y='Rating',hue='Predicted_Cluster',data=df,s=50)plt.xticks(rotation=90)plt.show()#### InterpretationAs rating is not normally-distributed, Mann-Whitney's U test was applied to test the significance of rating differences, since this test is free from a normal assumption. At the 0.05 significance level, results of the u tests on different categories demonstrate that the free and paid apps in the following categories have significant rating differences: personalization, tools, family and games. Paid apps are on average higher rated than free apps in these categories.As rating is not normally-distributed, Mann-Whitney's U test was applied to test the significance of rating differences, since this test is free from a normal assumption. At the 0.05 significance level, results of the u tests on different categories demonstrate that the free and paid apps in the following categories have significant rating differences: personalization, tools, family and games. Paid apps are on average higher rated than free apps in these categories.
### Concluding RemarksData analysis was conducted on the Kaggle Google Play Store dataset, the answers to the three questions were explored:- How do the ratings differ between paid and free apps in general? In general, Paid apps are better-rated than free apps, which appears to support the argument that service quility of the paid apps is better.- How are the differences distributed across different app categories? In most categories, Paid apps achieve higher ratings than free apps, however, in a few categories such as COMMUNICATION, FINANCE and PHOTOGRAPHY, the average ratings of free apps are higher than those of paid apps. Is this because many popular apps in these categories are free, like facebook and whatsapp in the COMMUNICATION category?- Are there any categories where the differences are statistically significant? There are four categories (PERSONALIZATION, TOOLS, FAMILY and GAME) where paid apps are rated significantly higher than free apps. This is only a very superficial exploration of the Google Play Store dataset. There are many other useful information including installation counts and app review texts, which might entail many more interesting facts and await further exploration. Data analysis was conducted on the Kaggle Google Play Store dataset, the answers to the three questions were explored:
This is only a very superficial exploration of the Google Play Store dataset. There are many other useful information including installation counts and app review texts, which might entail many more interesting facts and await further exploration.
xxxxxxxxxx